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ABSTRACT 

Location based services (LBS) have become very popular in re¬ 
cent years. They range from map services (e.g., Google Maps) 
that store geographic locations of points of interests, to online so¬ 
cial networks (e.g., WeChat, Sina Weibo, Foursquare) that lever¬ 
age user geographic locations to enable various recommendation 
functions. The public query interfaces of these services may be 
abstractly modeled as a fcNN interface over a database of two di¬ 
mensional points on a plane: given an arbitrary query point, the 
system returns the k points in the database that are nearest to the 
query point. In this paper we consider the problem of obtaining 
approximate estimates of SUM and COUNT aggregates by only 
querying such databases via their restrictive public interfaces. We 
distinguish between interfaces that return location information of 
the returned tuples (e.g., Google Maps), and interfaces that do not 
return location information (e.g., Sina Weibo). For both types of in¬ 
terfaces, we develop aggregate estimation algorithms that are based 
on novel techniques for precisely computing or approximately es¬ 
timating the Voronoi cell of tuples. We discuss a comprehensive 
set of real-world experiments for testing our algorithms, including 
experiments on Google Maps, WeChat, and Sina Weibo. 

1. INTRODUCTION 

1.1 LBS with a fcNN Interface 

Location based services (LBS) have become very popular in re¬ 
cent years. They range from map services (e.g., Google Maps) 
that store geographic locations of points of interests (POIs), to on¬ 
line social networks (e.g., WeChat, Sina Weibo, Foursquare) that 
leverage user geographic locations to enable various recommenda¬ 
tion functions. The underlying data model of these services may 
be viewed as a database of tuples that are either POIs (in case of 
map services) or users (in case of social networks), along with their 
geographical coordinates (e.g., latitude and longitude) on a plane. 

However, third-party applications and/or end users do not have 
complete and direct access to this entire database. The database is 
essentially “hidden”, and access is typically limited to a restricted 
public web query interface or API by which one can specify an 
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arbitrary location as a query, which returns at most fc nearest tuples 
to the query point (where fc is typically a small number such as 
10 or 50). For example, in Google maps it is possible to specify an 
arbitrary location and get the ten nearest Starbucks. Thus, the query 
interfaces of these services may be abstractly modeled as a “nearest 
neighbor” fcNN interface over a database of two dimensional points 
on a plane: given an arbitrary query point, the system returns the fc 
points in the database that are nearest to the query point. 

In addition, there are important differences among the services 
based on the type of information that is returned along with the 
fc tuples. Some services (e.g., Google maps) return the locations 
(i.e., the x and y coordinates) of the fc returned tuples. We refer to 
such services as Location-Returned LBS (LR-LBS). Other services 
(e.g., WeChat, Sina Weibo) return a ranked list of fc nearest tuples, 
but suppress the location of each tuple, returning only the tuple ID 
and perhaps some other attributes (such as tuple name). We refer 
to such services as Location-Not-ReturnedLBS (LNR-LBS). 

Both types of services impose additional querying limitations, 
the most important being a per user/IP limit on the number of queries 
one can issue over a given time frame (e.g., by default, Google map 
API imposes a query rate limit of 10,000 per user per day). 

1.2 Aggregate Estimations 

For many interesting third-party applications, it is important to 
collect aggregate statistics over the tuples contained in such hid¬ 
den databases — such as sum, count, or distributions of the tuples 
satisfying certain selection conditions. For example, a hotel rec¬ 
ommendation application would like to know the average review 
scores for Marriott vs Hilton hotels in Google Maps; a cafe chain 
startup would like to know the number of Starbucks restaurants in a 
certain geographical region; a demographics researcher may wish 
to know the gender ratio of users of social networks in China etc. 

Of course, such aggregate information can be obtained by en¬ 
tering into data sharing agreements with the location-based service 
providers, but this approach can often be extremely expensive, and 
sometimes impossible if the data owners are unwilling to share their 
data. Therefore, in this paper we consider the problem of obtain¬ 
ing approximate estimates of such aggregates by only querying the 
database via its restrictive public interface. Our goal is to minimize 
query cost (i.e., ask as few queries as possible) in an effort to ad¬ 
here to the rate limits imposed by the interface, and yet make the 
aggregate estimations as accurate as possible. 

The closest prior work is m■ This approach is based on gen¬ 
erating random point queries, estimating the area of the Voronoi 
cell 03 of the returned top-1 tuple for each query, and estimating 
the aggregate from these top-1 tuples by making corrections for 
sampling bias using the area of the Voronoi cell. However, there 
are several limitations of this work. First, this approach works only 
for LR-LBS, but does not work for LNR-LBS, and is thus inappli- 


cable over a large variety of location based services such as WeChat 
and Sina Weibo that do not return precise location or distance in¬ 
formation. Second, the approximate nature of the technique used 
for estimating the area of a Voronoi cell makes the overall aggre¬ 
gate estimation inherently biased. Third, the method only uses the 
top-1 returned tuple for each query in its calculations (the remain¬ 
ing k — 1 tuples are ignored) thus leading to inefficiency in the 
estimation procedure. We discuss this and other related work in fjT] 

1.3 Outline of Technical Results 

Results over LR-LBS Interfaces: We first describe our results 
over LR-LBS interfaces. Like Qo), our approach is also based on 
generating random point queries and computing the area of Voronoi 
cells of the returned tuples, but a key differentiator is that our ap¬ 
proach provides an efficient yet precise computation of the area 
of Voronoi cells. This procedure is fundamentally different from 
the approximate procedure used in |10| for estimating the area of 
Voronoi cells, and is one of the significant contributions of our 
paper. This leads to unbiased estimations of SUM and COUNT 
aggregates over LR-LBS interfaces; in contrast, the estimations 
in (TO) have inherent (and unknown) sampling bias. 

Moreover, we also leverage the top-fc returned tuples of a query 
(and not just the top-1) by generalizing to the concept of a top-k 
Voronoi cell , leading to significantly more efficient aggregate esti¬ 
mation algorithms. We also developed four different techniques for 
reducing the estimation error (and thereby estimation error) over 
LR-LBS interfaces: faster initialization, leveraging history, vari¬ 
ance reduction through dynamic selection of query results, and a 
Monte Carlo method which leverages current knowledge of up¬ 
per/lower bounds on the Voronoi cell without sacrificing the un¬ 
biasedness of estimations. 

We combine the above ideas to produce Algorithm LR-LBS- 
AGG, a completely unbiased estimator for COUNT and SUM queries 
with or without selection conditions. We note that AVG queries can 
be computed as SUM/COUNT. 

Results over LNR-LBS Interfaces: We also consider the problem 
of aggregate estimations over LNR-LBS interfaces. To the best of 
our knowledge, this is a novel problem with no prior work. Recall 
that such type of fcNN interfaces only return a ranked list of the 
top-fc tuples in response to a point query, and location information 
for these tuples is suppressed. None of the prior work for LR-LBS 
interfaces can be extended to LNR-LBS interfaces. For such in¬ 
terfaces, we develop aggregate estimation algorithms that are not 
completely bias-free, but can guarantee an arbitrarily small sam¬ 
pling bias. The key idea here is the inference of a tuple’s Voronoi 
cell to an arbitrary precision level with a small number of queries 
from merely the ranks of the returned tuples. 

On a related note, we also show how one can infer the position 
of a tuple in LNR-LBS, again at a level of arbitrary precision - a 
problem, while of independent interest, is also critical for enabling 
the estimations of aggregates that feature selection conditions on 
tuples’ locations (e.g., the COUNT of social network users within 
10 meters of major highways). We also study a subtle extension to 
cases where k > 1; in particular we study the challenge brought by 
this case by the (possibly) concave nature of top-fc Voronoi cells, 
and develop an efficient algorithm to detect potential concaveness 
and guarantee the accuracy of the inferred Voronoi cell. 

We combine the above ideas to produce Algorithm LNR-LBS- 
AGG, an estimator for COUNT and SUM queries with or without 
selection conditions. Unlike Algorithm LR-LBS-AGG, this esti¬ 
mator may be biased, but the bias can be controlled to any arbitrary 
desired precision. As before, we note that AVG queries can be 
computed as SUM/COUNT. 


1.4 Summary of Contributions 

• Location based services have become very popular in recent 
years, and aggregate estimation over such “hidden” databases 
with their restricted fcNN query interfaces is an important 
problem with numerous applications. In our work, we con¬ 
sider both LR-LBS (locations returned) as well as the more 
novel LNR-LBS (locations not returned) interfaces. 

• For LR-LBS interfaces, we develop Algorithm LR-LBS-AGG 
for estimating COUNT and SUM aggregates with or without 
selection conditions. It represents a significant improvement 
over prior work along multiple dimensions: a novel way of 
precisely calculating Voronoi cells lead to completely unbi¬ 
ased estimations; top-fc returned tuples are leveraged rather 
than only top-1; several innovative techniques developed for 
reducing error and increasing efficiency. 

• For LNR-LBS interfaces, we develop Algorithm LNR-LBS- 
AGG for estimating COUNT and SUM aggregates with or 
without selection conditions.This is a novel problem with no 
prior work. The estimated aggregates are not bias-free, but 
the sampling bias can be controlled to any desired precision. 
Among several key ideas, we show how a Voronoi cell can 
be inferred to an arbitrary degree of precision from merely 
the ranks of returned tuples to point queries. 

• Our contributions also include a comprehensive set of real- 
world experiments. Specifically, we conducted online tests 
over a number of real-world LBS, e.g., Google Maps (LR- 
LBS) for estimating the number of Starbucks in US (and 
compared the results with the ground truth published by Star- 
bucks); WeChat and Sina Weibo for estimating the percent¬ 
age of male/female users in China. 

2. BACKGROUND 
2.1 Model of LBS 

A location based service (LBS) supports fcNN queries over a 
database D of tuples with location information. These tuples can 
be points of interest (e.g. Google Maps) or users (e.g. WeChat, 
Sina Weibo). A fcNN query q takes as input a location (e.g., lon¬ 
gitude/latitude combination), and returns the top-fc nearest tuples 
selected and ranked according to a pre-determined ranking func¬ 
tion. Since the only input to a query is a location, we use q to also 
denote the query’s location without introducing ambiguity. Most of 
the popular LBS follow fcNN query model. For most parts of the 
paper, we consider the ranking function to be Euclidean distance 
between the query location and each tuple’s location. Extensions 
to other ranking functions are discussed in §5.3| 

Note that tuples in an LBS system contain not only location in¬ 
formation but other many other attributes - e.g., a POI in Google 
Maps includes attributes such as POI name, average review ratings 
etc. Depending on which attributes of a tuple are returned by the 
fcNN interface - more specifically, whether the location of a tuple 
is returned - we can classify LBS into two main categories: 

LR-LBS: A Location-Returned-LBS (LR-LBS) returns the precise 
location for each of the top-fc returned tuples, along with possibly 
other attributes. Google Maps, Bing Maps, Yahoo! Maps, etc., are 
prominent examples of LR-LBS, as all of them display the precise 
location of each returned POL Note that some LBS may return a 
variant of the precise locations - e.g., Skout and Momo returns not 
the precise location of a tuple, but the precise distance between 
the query location and the tuple location. We consider such LBS 
to be in the LR-LBS category because, through previously studied 


techniques such as trilateration (e.g., |18|), one can infer the precise 
location of a tuple with just 3 queries. 

LNR-LBS: A Location-Not-Returned-LBS (LNR-LBS), on the other 
hand, does not return tuple locations - i.e., only other attributes 
such as name, review rating, etc., are returned. This category is 
more prevalent among location based social networks (presumably 
because of potential privacy concerns on precise user locations). 
Examples here include WeChat, which returns attributes such as 
name, gender, etc., for each of the top-fc users, but not the precise 
location/distance. Other examples include Sina Weibo, WeChat, 
etc., which feature a similar query return semantics. 

Common Interface Features and Limitations: Generally speak¬ 
ing, there are two ways through which an LBS (either LR- or LNR- 
LBS) supports a fcNN query. One is an interactive web or API 
interface which allows a location to be explicitly specified as a lat¬ 
itude/longitude pair. Google Maps is an example to this end. An¬ 
other common way is for the LBS (e.g., as a mobile app) to directly 
retrieve the query location from a positioning service (such as GPS, 
WiFi or Cell towers) and automatically issue a fcNN query accord¬ 
ingly. In the second case, there is no explicit mechanism to enter 
the location information. Nonetheless, it is important to note that, 
even in this case, we have the ability to issue a query from any ar¬ 
bitrary location without having to physically travel to that location. 
All mobile OS have debugging features that allow arbitrary location 
to be used as the output of the positioning (e.g., GPS) service. 

Many LBS also impose certain interface restrictions: One is the 
aforementioned top-fc restriction (i.e., only the fc nearest tuples are 
returned). Another common one is a query rate limit - i.e., many 
LBS limit the maximum number of fcNN queries one can issue per 
unit of time. For example, by default Google Maps allows 10,000 
location queries per day while Sina Weibo allows only 150 queries 
per hour. This is an important constraint for our purpose because it 
makes the query-issuing process the bottleneck for aggregate esti¬ 
mation. To understand why, note that even with the generous limit 
provided by Google Maps, one can issue only 7 queries per minute 

- this 8.6 second per query overheac£]is orders of magnitude higher 
than any offline processing overhead that may be required by the 
aggregate estimation algorithm. Thus, this interface limitation es¬ 
sentially makes queiy cost the No. 1 performance metric to opti¬ 
mize for aggregate estimation. An LBS might impose other, more 
subtle constraints - e.g., a maximum coverage limit which forbids 
tuples far away (say more than 5 miles away) from a query location 
to be returned. We shall discuss about these constraints in §5.3| 

2.2 Voronoi Cells 

Voronoi cell is a key geometry concept used extensively by 
our algorithms developed in the paper. Thus, we introduce this 
concept here as part of the preliminaries. Consider each tuple t £ 

D as a point on a Euclidean plane bounded by a box B (which 
covers all tuples in D). We have the following definition. 

Definition 1 (Voronoi Cell). Given a tuple t £ D, the 
Voronoi cell oft, denoted by V(t), is the set of points on the B- 
bounded plane that are closer to t than any other tuple in D. 

Note that the B-bound ensures that each Voronoi cell is a finite 
region. The Voronoi cells of different tuples are mutually exclusive 

- i.e., the Voronoi diagram is the subdivision of the plane into re¬ 
gions, each corresponding to all query locations that would return 
a certain tuple as the nearest neighbor! 

Of course, one can shorten it with multiple IP addresses and API accounts - but 
similarly, one can use parallel pr oces sing to speed up offline processing as well. 

2 We assume general positioning 1151 - i.e., no two tuples have the exact same location 
and no four points on the same circle. 


For the purposes of our paper, we define an extension of the 
Voronoi cell concept to accommodate the top-fc (when fc > 1) 
query return semantics. Specifically, given a tuple t £ D, we define 
the top-fc Voronoi cell of t , denoted by 14 (t), as the set of query lo¬ 
cations on the plane that return t as one of the top-fc results. There 
are four important observations about this concept: 

First, the top-fc Voronoi cells for different tuples are no longer 
mutually exclusive. Each location l belongs to exactly fc top-fc 
Voronoi cells corresponding to the top-fc tuples returned by query 
over l. Second, our concept of top-fc Voronoi cells is fundamen¬ 
tally different from the fc-th ordered Voronoi cells in geometry G3 
- each of which is formed by points with the exact same fc closest 
tuples. The difference is illustrated in Figure|T]- while each colored 
region is a fc-th ordered Voronoi cell, a top-fc Voronoi cell may 
cover multiple regions with different colors. For example, the top- 
2 Voronoi cell for tuple A is marked by a red border and consists of 
two different fc-th ordered Voronoi cells (AB and AE). 


Figure 1: Concavity of top-fc Voronoi Diagrams 

Third, while both top-1 Voronoi cells and fc-th order Voronoi 
cells are guaranteed to be convex |15| , the same does not hold for 
top-fc Voronoi cells when fc > 1. For example, from Figure [T] we 
can see that the aforementioned top-2 Voronoi cell for tuple A is 
concave. Fourth, a top-fc Voronoi cell tend to contain many more 
edges than a top-1 Voronoi cell. As we shall discuss later in the 
paper, the larger number of edges and the potential concaveness 
makes computing the top-fc Voronoi cell of a tuple t more difficult. 

2.3 Problem Definition 

In this paper, we address the problem of aggregate estimations 
over LBS. Specifically, we consider aggregate queries of the form 
SELECT AGGR(f) From D Where Cond where Aggr is an ag¬ 
gregate function such as SUM, COUNT and AVG that can be eval¬ 
uated over one or more attributes of t , and Cond is the selection 
condition. Examples include the COUNT of users in WeChat or 
AVG rating of restaurants in Texas at Google Maps. 

There are two important notes regarding the selection condition 
Cond. First, we support any selection condition that can be inde¬ 
pendently evaluated over a single tuple - i.e., it is possible to de¬ 
termine whether a tuple t satisfies Cond based on nothing but t. 
Second, for both LR- and LNR-LBS, we support the specification 
of a tuple’s location as part of Cond - even when such a location 
is not returned, like in LNR-LBS. This is possible thanks to what 
we shall discuss in : |4.3| - i.e., even with LNR-LBS, one can derive 
the location of a tuple to arbitrary precision after issuing a small 
number of queries. As such, we support aggregates such as the 
percentage of female WeChat users in Washington, DC). 

In most part of the technical sections, we focus on aggregates 
without selection conditions - the straightforward extensions to var¬ 
ious types of selection conditions will be discussed in S|5] 

Performance Measures: The performance of an aggregate esti¬ 
mation algorithm is measured in terms of efficiency and accuracy. 
Given the query-rate limit enforced by all LBS, the efficiency is 
measured by query cost - i.e. the number of queries and/or API 












Figure 2: Illustration of Theorem 1 
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calls that the algorithm issues to LBS. Often, we are given a fixed 
budget (based on the rate limits) and hence designing an efficient 
algorithm that generates accurate estimates within the budgetary 
constraints is crucial. The accuracy of an estimation 9 of an aggre¬ 
gate 9 could be measured by the standard measure of relative error 
1 9 — 9\/9. Note that, for any sampling-based approach (like ours), 
the relative error is determined by two factors: bias, i.e. | E(9 — 9)\ 

, and variance of 9. The mean squared error, MSE of the estimation 
is computed as MSE = bias' + variance. 

An interesting question often arises in practice is how we can 
determine the relative error achieved by our estimation. If the pop¬ 
ulation variance is known, then one can apply standard statistics 
techniques to compute the confidence interval of aggregate estima¬ 
tions ED- Absence of such knowledge, a common practice is to 
approximate the population variance with sample variance, which 
can be computed from the samples we use to generate the final es¬ 
timation and use Bessel’s correction CCD to correct the result. 


timation of COUNT(*): r = l/p(t), because 

Exp(r) = ^p(f) • Jry = \D\, (1) 

teD ' 

where Exp(-) is the expected value of the estimation (taken over 
the randomness of the estimation process), and | D \ is the total num¬ 
ber of tuples in the database. From 0- one can see that every SUM 
and COUNT aggregate we support can be estimated without bias - 
the only change required is on the numerator of estimation. Instead 
of having 1 as in the COUNT(*) case, it should be the evaluation 
of the aggregate over I, - e.g., if we need to estimation SUM(Ai) 
where Ai is an attribute, then the numerator should be t[Ai], i.e., 
the value of Ai for t. If the aggregate is COUNT with a selection 
condition, then the numerator should be either 1 if t satisfies the 
condition, or 0 if it does not. One can see from the above discus¬ 
sions that, essentially, the problem of enabling unbiased SUM and 
COUNT estimations is reduced to that of precisely computing the 
volume of V (f), i.e., the Voronoi cell of a given tuple t. 


3. LR-LBS-AGG 

In this section, we develop LR-LBS-AGG, our algorithm for 
generating unbiased SUM and COUNT estimations over an LR- 
LBS query interface. Specifically, we start with introducing our 
key idea of precisely computing the (top-fc) Voronoi cell of a given 
tuple, which enables the unbiased aggregate estimations. While 
this idea guarantees unbiasedness, it may require a large number of 
queries per (randomized) estimation, leading to a large estimation 
variance (and therefore, error) when the query budget is limited. 
Hence we develop four techniques for reducing the estimation er¬ 
ror while maintaining the complete unbiasedness of aggregate es¬ 
timations. Finally, we combine all ideas to produce Algorithm LR- 
LBS-AGG at the end of this section. 

3.1 Key Idea: Precisely Compute Voronoi Cells 

Reduction to Computing Voronoi Cells: We start by describing 
a baseline design which illustrates why the problem of aggregate 
estimations over an LBS’s fcNN interface ultimately boils down to 
computing the volume of the Voronoi cell corresponding to a tuple 
t. As an example, consider the estimation of COUNT(*) (over a 
given region) through an LR-LBS with a top-1 interface. 

We start by choosing a location q uniformly at random from the 
region, and then issue a query at q. Let t be the tuple returned by 
q. Suppose that we can compute the Voronoi cell of t (as defined in 
say V(t). A key observation here is that the sampling proba¬ 
bility of f, i.e., the probability for the above-described randomized 
process to return f, is exactly p(t) = | V /| where |V(t)| and |Vo 
are the volume of V ( t ) and the entire region, respectively. Note 
that knowledge of p(t) directly leads to a completely unbiased es¬ 


Computing Voronoi Cells: For computing the Voronoi cell of a 
given tuple, a nice feature of the LR-LBS interface is that it re¬ 
turns the precise location of every returned tuple. Clearly, if we can 
somehow “collect” all tuples with Voronoi cells adjacent to that of 
t, then we can precisely compute the Voronoi cell of t based on 
the locations of these tuples (and t). As such, the key challenges 
here become: (1) how do we collect these tuples and (2) how do we 
know if/when we have collected all tuples with adjacent Voronoi 
cells to f? Both challenges are addressed by the following theorem 
which forms the foundation of design of Algorithm LR-LBS-AGG. 

THEOREM 1. Given a tuple t £ D and a subset of tuples D' C 
D such that t £ D 1 . the Voronoi cell of t defined according to 
D', represented by P', is the same as that according to the entire 
dataset D, denoted by P. if and only if for all vertices v of P', all 
tuples returned by the nearest neighbor query issued at v over D 
belong to D'. 

PROOF. First, note that there must be P C P', because for a 
given location q, if there is already a tuple t! in D' that is closer to 
q than t, then there must at least one tuple in D that is closer to q 
than t. Second, if P P' (i.e., P C P'), then there must at least 
one vertex of P', say v, that falls outside P. i.e. there must exist a 
tuple to £ (D\D') that is closer to v than all tuples in D'. □ 

Example 1: Figure [2] provides an illustration for Theorem]!] In 
order to compute the Voronoi cell of the tuple corresponding to the 
red dot, it suffices to know the location of the adjacent tuples. Since 
each Voronoi edge is a perpendicular bisector between the adjacent 
tuples, the entire Voronoi cell can be computed as the convex shape 
induced by the intersections of the edges. 

Theorem ^ answers both challenges outlined above: it tells us 
when we have collected all “adjacent” tuples - when all vertices of 



















£’s Voronoi cell computed from the collected tuples return only col¬ 
lected tuples. It also tells us how to collect more “adjacent” tuples 
when not all of them have been collected - any vertex which fails 
the test naturally returns some tuples that have not been collected 
yet, adding to our collection and starting the next round of tests. 

According to the theorem, a simple algorithm for constructing 
the exact Voronoi cell for £ is as follows: We start with D' = {£}. 
Now the Voronoi cell is the entire region (say an extremely large 
rectangle). We issue queries corresponding to its four vertices. If 
any query returns a point we have not seen yet - i.e., not in D' - we 
append it to D ', recompute the Voronoi cell, and repeat the process. 
Otherwise, if all queries corresponding to vertices of the Voronoi 
cell return points in D ', we have obtained the real Voronoi cell for 
t £ D. One can see that the query complexity of this algorithm is 
O(n), where n is the number of points in the database D, because 
each query issued either confirms a vertex of the final Voronoi cell 
(which has at most n — 1 vertices), or returns us a new point we 
have never seen before (there are at most n — 1 of these too). It is 
easy to see that the bound is tight - as one can always construct a 
Voronoi cell that has n — 1 edges and therefore requires S2(n) top-1 
queries to discover (after all, each such query returns only 1 tuple). 
An example here is when £ is in the center of a circle, on which the 
other n — 1 points are located. Algorithm[T]shows the pseudocode 
of the baseline approach which we improve in Section [372] 

Algorithm 1 LNR-LBS-AGG-Baseline 
l: while query budget is not exhausted 

2 : q = location chosen uniformly at random; £ = query(g) 

3: V{t) = Vo; D' = {t} 

4: repeat till D' does not change between iterations 

5: for each vertex v of V (£): D' = D' U query (v) 

6 : Update V (£) from D' 

7: Produce aggregate estimation using samples 


Example 2: Figure [3] provides a simple run-through of the al¬ 
gorithm for a dataset with 5 tuples {£i,...,£ 5 }. Suppose we wish 
to compute V{t. i). Initially, we set D' = { £4 } and V{tV) = Vo, 
the entire bounding box. We issue query q± that returns tuple to 
and hence D' = {£4, to}. We now obtain a new Voronoi edge that 
is the perpendicular bisector between £4 and to . The Voronoi cell 
after step 1 is highlighted in light grey. In step 2, we issue query 
q 2 that returns £4 resulting in no update. In step 3, we issue query 
53 that returns £3. D' = {£3, £4, £5} and we obtain a new Voronoi 
edge as the perpendicular bisector between £3 and £4 depicted in 
dark medium gray. In step 4, we issue query <74 that returns £2 
resulting in the final Voronoi edge depicted in dark grey. Further 
queries over the vertices for V(£ 4 ) does not result in new tuples 
concluding the invocation of the algorithm. 

Extension to k > 1: Interestingly, no change is required to the 
above algorithm when we consider the top-fc Voronoi cell rather 
than the traditional, i.e., top-1 Voronoi cell. To understand why, 
note that Theorem |T| directly extends to top-fc Voronoi cells - as a 
top-fc Voronoi computed from D' still must completely cover that 
for D ; and any vertex of the top-fc Voronoi from D' which is out¬ 
side that from D must return at least one tuple outside D'. We 
further describe how to leverage fc > 1 in Sections [3.2.3| and |4.2| 

3.2 Error Reduction 

Before describing the various error reduction techniques we de¬ 
velop for aggregate estimations over LR-LBS, we would like to 
first note that, while we use the term “error reduction” as the title 
of this subsection, some of the techniques described below indeed 
focus on making the computation of a Voronoi cell more efficient. 


The reason why we call all of them “error reduction” is because 
of the inherent relationship between efficiency and estimation error 
- if the Voronoi-cell computation becomes more efficient, then we 
can do so for more samples, leading to a larger sample size and ul¬ 
timately, a lower estimation error (which is inversely proportional 
to the square root of sample size m- 

3. 2 .1 Faster Initialization 

A key observation from the design in j j3.1| is its bottleneck: the 
initialization process. At the beginning, we know nothing about 
the database other than ( 1 ) the location of tuple £, and ( 2 ) a large 
bounding box corresponding to the area of interest for the aggregate 
query. Naturally, D' = {£}, leading to the initial Voronoi cell being 
the bounding box, and our first four queries being the corners of 
these bounding boxes. Of course, the tentative Voronoi cell will 
quickly close in to the real one with speed close to a binary search - 
i.e., the average-case query cost is at log scale of the bounding box 
size. Nonetheless, the initialization process can still be very costly, 
especially when the bounding box is large. 

To address this problem, we develop a faster initialization tech¬ 
nique which features a simple idea: Instead of starting with D' = 
{£}, we insert four fake tuples into D' , say D' = {£, £f,..., £4 }, 
where tf , ..., £f form a bounding box around £. The size of 
the bounding box should be conservatively large - even though a 
wrongly set size will not jeopardize the accuracy of our computa¬ 
tion - as we shall show next. 

By computing the initial Voronoi cell from D' and then issue 
queries corresponding to its vertices, there are two possible out¬ 
comes: One is that these queries return enough real tuples (besides 
£, of course) that, after excluding the fake ones from D' , we still get 
a bounded Voronoi cell for £. One can see that, in this case, we can 
simply continue the computation while having saved a significant 
number of initialization queries. The other possible outcome, how¬ 
ever, is when the bounding box is set too small, and we do not have 
enough real tuples to “bound” £ with a real Voronoi cell. Specif¬ 
ically, in the extreme-case scenario, all four vertices of the initial 
Voronoi cell could return £ itself. In this case, we simply revert 
back to the original design, wasting nothing but four queries. 

One can see that the faster initialization process still guarantees 
the exact computation of a tuple’s Voronoi cell. It has the potential 
to save a large amount of initialization queries in the average-case 
scenario, while in the worst case, it wastes at most four queries. Al- 
gorithm[ 2 ]provides the pseudocode for faster initialization strategy. 


Algorithm 2 Fast-Init 

l: Input: £; Output: V(t) 

2: D' = {£, £f, £f, tf\ £ 4 ^}; Update V (£) based on D' 

3: If all queries over vertices of V (£) return £, then return Vo 
4: repeat till D' does not change between iterations 
5: for each vertex v of V (£): D' = D' U query (v) 

6: Update V (£) from D' 

7: return V (£) 


Example 3: Figures[4]and[5]show two different scenarios where 
the strategy is successful and not successful respectively based on 
whether the bounding box due to fake tuples is conservatively large. 
Given a small dataset with tuples {£ 1 ,..., £5}, we initialize them 
with a bounding box corresponding to fake tuples {/ 1 ,..., In 
Figure [4] the initial bounding box is tight enough and results in the 
computation of the precise V (£4) with much lower query cost (i.e. 
only tuples {£3, £5} are visited as against tuples {£2, £4, £5} for the 
example of Algorithm [T| On the other hand, if the bounding box 













is not tight (as in Figure [5}, then queries over all the vertices of 
the bounding box return £ 4 . We then revert back to the original 
bounding box Vo that covers the entire region. 

3.2.2 Leverage history on Voronoi-cell computation 
Another natural optimization is to leverage the information that 
is gleaned from computing the Voronoi cells of past tuples to com¬ 
pute a tighter initial Voronoi cell. Recall that our algorithm to com¬ 
pute Voronoi cell of a tuple t (i.e V (£)), using Theorem[I]starts with 
an initial Voronoi cell that is an extremely large bounding box that 
covers the entire plane that then converges to V ( t ). In the process 
of computing this Voronoi cell, our algorithm retrieved additional 
new tuples (by issuing queries for each vertex of the bounding box). 
Notice that for a LBS with static tuples (such as POIs in Google 
Maps), the results of location query ordered by distance remains 
static. Hence it is not necessary to restart every iteration of the 
algorithm with the same large bounding box. Specifically, when 
computing the Voronoi cell for the next tuple, we could leverage 
history by starting with a “tighter” initial bounding box whose ver¬ 
tices are the set of tuples that we have seen so far. In other words, 
we reuse the tuples that we have seen so far and make them as in¬ 
put to further rounds. Notice that this approach remains the same 
for both k = 1 and k > 1. Since the location of each tuple in 
top-fc are returned in LR-LBS, each of these tuples could be lever¬ 
aged. As we see more tuples, the initial Voronoi cell becomes more 
granular resulting in substantial savings in query cost. Algorithm[3] 
provides the pseudocode for the strategy. While the pseudocode 
uses the simple perpendicular bisector half plane approach (15| , it 
could also use more sophisticated approaches such as Fortune’s al¬ 
gorithm m to compute the bounding box around tuple t using the 
tuples from historic queries. 


Algorithm 3 Leverage-History 

1 : Input: t and H (set of tuples obtained from historic queries) 
2 : Output: Bounding box V'(t) 

3: V'(t) = Vo 

4: for each tuple h £ H 

5: Update V'(t) with perpendicular bisector between h and t 

6 : return V’ ( t ) with the tightest bounding box around f 


suggests otherwise - indeed, whether top-1 or top-/} Voronoi cell is 
better depends on the exact aggregate being estimated - specifically, 
whether the distribution of the attribute being aggregated is better 
“aligned” with the size distribution of top-1 or top -h Voronoi cells. 
To see why, simply consider an extreme-case scenario where the 
aggregate being estimated is AVG(Salary), and the salary of each 
user (tuple) is exactly proportional to the size of its top-1 Voronoi 
cell. In this case, nothing beats using the top-1 Voronoi cells as 
doing so produces zero variance and thus zero estimation error. 

Having said that, however, many aggregates can indeed be better 
estimated using top-/} Voronoi cells, because the sizes of these top- 
h cells are more uniform than those of the top -1 cells, which can 
vary extremely widely (see Figure m in the experiments section 
for an example), while many real-world aggregates are also more 
uniformly distributed than the top -1 cell volume (again, see exper¬ 
iments for justification). But simply increasing h also introduces 
an undesired consequence: recall from (| 2 ] that the larger h is, the 
more “complex” the top-/} Voronoi cell becomes - in other words, 
the more queries we have to spend in order to pin down the exact 
volume of the Voronoi cell. 

Thus, the key is to make a proper tradeoff between the benefit re¬ 
ceived (i.e., smaller variance per sample) and the cost incurred (i.e., 
larger query cost per sample). Our main idea is a combination of 
two methods: leveraging history in §3.2.2| and upper/lower bound 
approximation in j )3.2.4| Specifically, for each of the k returned 
tuples, we perform the following process: 

Consider £, returned as the No. i result. We need to decide which 
version of the Voronoi cell definition to use for £;. The answer can 
be anywhere from 1 to k. To make the determination, for all h £ 
[ 2 , k], we compute \h(ti), the upper bound on the volume of the 
top-k Voronoi cell of ti, as computed from all historically retrieved 
tuples. Then, we choose the largest h which satisfies A h(ti) < Ao, 
where Ao is a pre-determined threshold (the intuitive meaning of 
which shall be elaborated next). Let the chosen value be h(ti). If 
none of h £ [2, k] satisfies the threshold, we take h{t/) = 1. Then, 
if h{ti) < i, we compute the top-/} Voronoi cell for ti. The final 
estimation from the k returned results becomes: 


E 

ti :h(t 


Q(ti) 

\V h (L)\ 


( 2 ) 


Example 4: As part of computing V{tV) (see Example 1), we 
have the locations of £3,...,£5. Using this information, we can 
compute the initial bounding box for £2 (shown in red around £2 in 
Figure[ 6 } offline - i.e. without issuing any queries. 

3.2.3 Variance reduction with larger k 
When the system has k > 1, we can of course still choose to use 
the top-1 Voronoi cell as if only the top result is returned. Or we 
can choose from any of the top-/i Voronoi cells as long as h < k. 
While intuitively it might appear that using all k returned tuples is 
definitely better than using just the top- 1 , the theoretical analysis 


for any SUM or COUNT query Q, where |14(£<)| is the volume 
for the top-fi Voronoi cell of ti. 

We now explain the intuition behind the above approach, specif¬ 
ically the threshold Ao. First, note that if the top-/i (say top-1) 
Voronoi cell of ti is already large, then there is no need to further 
increase h. The reason can be observed from the above-described 
justification of variance reduction - note that a large top-1 Voronoi 
cell translates to a large selection probability p - i.e., a small 1/p 
which adds little to the overall variance. Further increasing h not 
only contributes little to variance reduction, but might actually in¬ 
crease the variance if 1/p is already below the average value. 




























Second, admittedly, A h(ti) is only an upper-bound estimate - 
i.e., even though we showed above that an already large top -h Voronoi 
cell does not need to have h further increased, there remains the 
possibility that Xh(ti) is large because of an overly loose bound 
(from history), rather than the real volume of the Voronoi cell. 
Nonetheless, note that this is still a negative signal for using such 
a large h - as it means that we have not thoroughly explored the 
neighborhood of ti. In other words, we may need to issue many 
queries in order to reduce our estimation (or computation) of | Vfc (ti ) | 
from Xh(ti) to the correct value. As such, we may still want to 
avoid using such a large h in order to reduce the query cost. 

While the above explanation is heuristic in nature it is important 
to note that, regardless of how we set h(ti), the estimation we pro¬ 
duce for SUM and COUNT aggregates in |2| is always unbiased. 


Algorithm 4 Variance-Reduction 

l: Input: H\ Output: Aggregate estimate from all top-fc tuples 
2 : q = location chosen uniformly at random 
3: for each tuple ti returned from query(g) 

4: h{ti) = max{h\h € [ 2 , fc], Xh{U) < Ao} 

5: h(ti) = 1 if no h satisfied the condition A h{ti) < Ao 

6 : Generate estimate for ti using Equation[2] 


3.2.4 Upper/lower bounds on Voronoi-cell 

Note that in the entire process of Voronoi-cell computation (bar¬ 
ring the very first step of the faster initialization idea discussed in 
§3.2.1| >, we maintain a tentative polygon that covers the entire real 
Voronoi cell - i.e., an upper bound on its volume. What often arises 
in practice, especially when computing top-fc Voronoi cells (which 
tend to have many edges), is that even though the bounding poly¬ 
gon is very close to the real Voronoi cell in volume, it has far fewer 
edges - meaning we still need to issue many more queries to pin 
down the exact Voronoi cell. 

The key idea we develop here is to avoid such query costs with¬ 
out sacrificing the accuracy of our aggregate estimations. Specifi¬ 
cally, consider a simple Monte Carlo approach which chooses uni¬ 
formly at random a point from the current bounding polygon, and 
then issues a query from that point. If the query returns t - i.e., it 
is in the Voronoi cell of t, we stop. Otherwise, we repeat this pro¬ 
cess. Interestingly, the number of trials it takes to reach a point that 
returns f, say r, is an unbiased estimation of | V' (f)|/| V (t)|, where 
|U'(f)| and |V(t)| are the volumes of the bounding polygon and 
the real Voronoi cell of t, respectively. 


Exp(r) 



\V\t) | 
\V(t)\ • 


In other words, we can maintain the unbiasedness of our estima¬ 
tion without issuing the many more queries required to pin down 
the exact Voronoi cell. Instead, when V'(t) is close enough to 
V (t), we can simply use call upon above-described method which, 
in most likelihood, requires just one more query to produce an un¬ 
biased SUM or COUNT estimation. For example, we can simply 
multiply the number of trials r by |Vo|/| V'(t)\, where |Vo| is the 
volume of the entire region under consideration, to produce an un¬ 
biased estimation for COUNT)*). Other SUM and COUNT aggre¬ 
gates can be estimated without bias in analogy. 

Before concluding this idea, there is one more optimization we 
can use here: a lower bound on the top-fc Voronoi cell of t. In the 
following, we first discuss how to use such a lower bound to further 
reduce query cost, and then describe the idea for computing such 
a lower bound. Note that once we have knowledge of a region R 


that is covered entirely by the real (top-fc) Voronoi cell, if in the 
above process, we randomly choose a point q (from V'(t)) which 
happens to belong in R, then we no longer need to actually query 
q - instead, we immediately know that q must belong to V (t) and 
can produce an unbiased estimation accordingly. This is the cost 
saving produced by knowledge of a lower bound R. 

To understand how we construct this lower bound region, a key 
understanding is that, at anytime during the execution of our algo¬ 
rithm. we have tested certain vertices of V'(t) which are already 
confirmed to be part of V ( t ). Consider such a vertex v. Let C(v,t) 
be a circle with v being the center and the distance between t and 
v being the radius. Note that we are guaranteed to have obsen’ed 
all tuples within C(v,t). This essentially leads to a lower-bound 
estimation of V(t). Specifically, a point q is in this lower-bound 
region if and only if C(q, t), i.e., a circle centered on q with radius 
being the distance between q and t, is entirely covered by the union 
of C(v, t) for all vertices v of V'(t) that have been confirmed to 
be within V(t). As such, for any q in this region, we can save the 
query on it in the above process. 

Example 5: The upper bound V'(t 4) of V (fa) after Step 3 in the 
Example 2 (i.e. run-through of Algorithm LR-LBS-AGG-Baseline) 
is shown in Figure [7] as a quadrilateral with red edges. The three 
lower vertices of ^(( 4 ) are guaranteed to be in U(fa) using the 
criteria described above and hence the polygon induced by them 
provides a lower bound estimate for V(t. 4 ). 

3.3 Algorithm LR-LBS-AGG 

By combining the baseline idea for precisely computing the Voronoi 
cells with the 4 techniques for error reduction, we can design an ef¬ 
ficient algorithm LR-LBS-AGG for aggregate estimation over LR- 
LBS. Algorithm[5]shows the pseudocode for LR-LBS-AGG. 


Algorithm 5 LR-LBS-AGG 

l: while query budget is not exhausted 
2 : q = location chosen uniformly at random 

3: for each tuple ti in query(g) 

4: Compute optimal h for ti 

5: Construct initial Vfc(ti) using Algorithms [ 2 ] and [ 3 ] 

6 : D'= vertices of Vh(ti) 

7: repeat till D' is not updated or Voronoi bound is tight 

8 : for each vertex v of Vh(U ): D' = D'u query (v) 

9: Update Vh (ti) and 14' (ti) from D' 

10 : Produce aggregate estimation using samples 


4. LNR-LBS-AGG 

4.1 Voronoi Cell Computation: Key Idea 

We now consider the case where only a ranked order of points 
are returned - but not their locations. We shall start with the case of 
fc = 1 , and then extend to the general case of fc > 1 . 

We start by defining a primitive operation of “binary search” as 
follows. Consider the objective of finding the Voronoi cell of a tu¬ 
ple f in the database. Given any location ci and C 2 (not necessarily 
occupied by any tuple), where ci returns f, consider the half-line 
from ci passing through C 2 . Since a Voronoi cell is convex and ci 
resides within the Voronoi cell, this half-line has one and only one 
intersection with the Voronoi cell - which is associated with one or 
two edges of the Voronoi cell. We define the primitive operation 
of binary search for given ci, C 2 to be the binary search process of 
finding one Voronoi edge associated with the intersection. Please 
refer to Appendix[A]for the detailed design of this process. 

















Naturally, such a binary search process is associated with an er¬ 
ror bound on the precision of the derived edge. For example, we 
can set an upper bound e on the maximum distance between any 
point on the real Voronoi edge (i.e., a line segment) and its closest 
point on the derived edge, which we refer to as the maximum edge 
error, and use e as the objective of the binary search operation. One 
can see that the number of queries required for this binary search is 
proportional to log(l/e). See Appendix|A|for exact query cost. 

Given this definition, we now show that one can discover the 
Voronoi cell of t (up to whatever precision level afforded to us by 
the binary search operation) with a query complexity of 0 (m log(l/e)), 
where m is the number of edges for the Voronoi cell. Here is the 
corresponding process: 

We start with one query at point q which returns t. Then, we 
construct 4 points that bound q (say q± : (x{q) — 1 ,y(q)), qi : 
(x{q) + 1 ,y{q)}, <?3 : (x(q),y(q) - 1), q 4 : (x(q),y{q) + 1), 
where x(-) and y(-) are the two dimensions, e.g., longitude and lat¬ 
itude, of a location, respectively) and call upon the binary search 
operation to find the corresponding Voronoi edges intercepting the 
half lines from q to qi,..., q 4 , respectively. One can see that, no 
matter what the discovered edges might be, they must form a closed 
polygoij^jwhich we can use to initiate the testing process described 
in o If all vertices pass the test, then we have already obtained 
the Voronoi cell of t. Otherwise, for each vertex (say v) that fails 
the test, we perform the binary search operation on the location 
of v to discover another Voronoi edge. We repeatedly do so un¬ 
til all vertices pass the test - at which time we have obtained the 
real Voronoi cell - subject to whatever error bound specified for the 
binary search process (as described above). 

To compute the query cost of this process, a key observation is 
that each call of the binary search process after the initial step (i.e., 
a call caused by a vertex failing the test) increases the number of 
discovered (real) edges for the Voronoi cell by 1. Thus, the number 
of times we have to call the binary search process is O(m), leading 
to the overall query-cost complexity of 0(m log(l/e)). For the 
estimation error, we have the following theorem. 


THEOREM 2. The estimation bias for COUNT(*) is at most 
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where d(t ) is the nearest distance between t and another tuple in 
D, and e is the aforementioned maximum edge error. 

Estimation bias for other aggregates can be derived accordingly 
(given the distribution of the attribute being aggregated). One can 
make two observations from the theorem: First, the smaller max¬ 
imum edge error e is or the large inter-tuple distance d(t) is, the 
smaller the bias will be. Second, we can make the bias arbitrarily 
small by shrinking e - which leads to a log-scale increase of the 
query cost. 

Algorithm[ 6 ]shows the pseudocode for LNR-LBS-AGG that also 
utilizes some of the error reduction ideas from j ]3.2| 

Example 6: We consider the same dataset as Example 1, except 
that in LNR-LBS the locations of tuples are not returned. Figure[ 8 ] 
shows a run-through of the algorithm by which one of the Voronoi 
edges of V(t 4 ) is identified. Initially, the bounding box contains 
the entire region, i.e. Vo- and £2 are two lines starting from t 4 
constructed as per Algorithm [7] pi and P 2 are mid points of small 
line segments on l\ and £2 such that points on either side of them 
return different tuples when queried. The new estimated Voronoi 
edge is computed as the line segment connecting pi and p 2 . Please 
refer to Appendix-|A]for further details. 

3 In the extreme-case, some edges of this polygon might be part of the bounding box. 


Algorithm 6 LNR-LBS-AGG 

1: while query budget is not exhausted 
2 : q = location chosen uniformly at random: t=query(q) 

3: Construct four points q 4 ,... ,q 4 bounding t 

4: a = Binary-Search(<z;) Vi £ [1,4] 

5: V ( t) = closed polygon from Voronoi edges e 4 ,. .. ,e 4 

6: D'= vertices of V ( t) 

7: repeat till D' is not updated 

8 : for each vertex v of V ( t ): D' = D'\J query(v) 

9: Find Voronoi edges Vd £ D' and update V ( t ) 

10 : Produce aggregate estimation using samples 


4.2 Extension to k > 1 

A complication brought by the rank-only return semantics is the 
extension to cases with k > 1. Specifically, recall from Sj2]that the 
(extended) top-fc Voronoi cell might be concave when k > 1. In the 
case LR-LBS case, this does not cause any problem because, at any 
moment, our derived top-fc Voronoi cell is computed from the exact 
tuple locations of all observed tuples and (therefore) completely 
covers the real top-fc Voronoi cell. For LNR-LBS case, however, 
this is no longer the case: Since we unveil the top-fc Voronoi cell 
edge after edge, if we happen to come across one of the “concave 
edges” early, then we may settle on a sub-region of the real top-fc 
Voronoi cell. Figure[9]demonstrates an example for such a scenario. 

Fortunately, there is an efficient fix to this situation. To under¬ 
stand the fix, a key observation is that any “inward” (i.e., concave) 
vertex of a top-fc Voronoi cell, say that of t, must be at a position 
with equal distance to three tuples, one of them being t (Note: this 
might not hold for “outward” vertices). This property is proved in 
the following lemma. 

Lemma 1. Any inward vertex of the top-k Voronoi cell oft 
must be of equal distance to t and two other tuples in the database. 

PROOF. Consider a partition of the entire region into base cells, 
each of which returns a different combination of top-fc tuples. One 
can see that the top-fc Voronoi cell of t must be the union of one or 
more adjacent base cells. In addition, for general positioning (i.e., 
barring special positions such as bounding edges, etc.), any vertex 
of the top-fc Voronoi cell is formed by three edges (of some base 
cells in the partition). Now consider the three edges which form 
an inward vertex v, denoted by ei, e 2 , e 3 . Note that, given v is in¬ 
ward, one of the three edges must be inside the top-fc Voronoi cell 
of t. Let this edge be ei. One can see that both e 2 and e 3 separate 
the top-fc Voronoi cell from the outside - i.e., Vi £ {1, 2}, we have 
locations on one side of a returning t in top-fc while locations on 
the other side do not. That is, each of e 2 and e 3 must be the per¬ 
pendicular bisector of the line segment connecting t and another 
tuple in the database. Let these two tuples be t 2 and 1 3 for e 2 and 
e 3 , respectively. In other words, v must have equal distance to t, t 2 
andf 3 . □ 

Given this property, the extension to fc > 1 becomes straightfor¬ 
ward: Let D' be the set of all tuples we have observed which appear 
along with t in the top-fc result of a query answer. Let t £ D'. First, 
note that if the polygon we output is not the top-fc Voronoi cell of 
t, then it must be a sub-region of it missing at least one inward ver¬ 
tex. According to the above lemma, each inward vertex is formed 
by two perpendicular bisectors, each between t and another tuple. 
A key observation here is that at least one of the missed inward ver¬ 
tices must be entirely formed by tuples in D'. The reason is simple: 
if no missed inward vertex satisfies this property, then we must have 
found the correct top-fc Voronoi cell of t over D' - i.e., what we get 









Figure 9: Handling Concavity of top-/.- Voronoi Diagrams 


so far must be a super-region of the correct top-fc Voronoi of £ over 
the entire database, contradicting our previous conclusion that it is 
a sub-region. 

Now our task is reduced to finding such a missing inward vertex. 
Note that this is equivalent with finding the perpendicular bisector 
of £ and every other tuple in D' - as once these perpendicular bi¬ 
sectors are identified, the rest is simply getting their intersections 
which can be done offline. For each tuple in D' , we either have al¬ 
ready identified the perpendicular bisector through one of the pre¬ 
vious calls to the binary search process - or we can initiate a new 
one as follows. 

Specifically, to find the perpendicular bisector of £ and t' G D ', 
note that t' being in D' means that (1) at least one of the vertices of 
the polygon we currently have must return £', and (2) at least one 
of the vertices of the polygon we currently have must not return 
t' . In other words, there must exist an edge of our current polygon 
which has two vertices once returning t' and the other does not - 
i.e., this edge intercepts with the perpendicular bisector of £ and 
t'. As such, we simply need to return the binary search process 
over this edge to find the perpendicular bisector, and then use it 
to update our polygon. We repeat this process iteratively until we 
have enumerated all perpendicular bisectors of £ and other nodes in 
D' - at which time we can conclude that there is no missing inward 
vertex. In other words, we have found the top-fc Voronoi cell of 
£. One can see that the query complexity of this process remains 
at 0(m log(l/e)), as every new binary search process called will 
return us a new edge for the top-fc Voronoi cell. 

4.3 Tuple Position Computation 

Another important problem in the LNR-LBS case is the compu¬ 
tation of a tuple’s position, since such information is not returned 
in query answers as in the LR-LBS case. As discussed in the intro¬ 
duction, this problem can be of independent interest - it can also be 
called upon as a subroutine for aggregate query processing when 
the selection condition involves a tuple’s location. For example, 
one might be interested in the number of WeChat users within 20 
meters of major highways (i.e., those who are likely driving). To 
estimate this aggregate, we need to compute the location of a tuple 
(i.e., a WeChat user) in order to determine whether it satisfies the 
selection condition for the aggregate query. 

Once we compute the Voronoi cell for a tuple £, the computation 
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Figure 10: Demonstration of l\iple Position Computation 

of £’s exact location takes only two additional calls to the binary 
search process. The key idea of this computation is demonstrated 
in Figure[l 0 ] The figure depicts one vertex of the top -1 Voronoi cell 
of ti. Let the vertex (at the center of the figure) be the origin point 
o. The figure includes two edges of the Voronoi cell, di and <£3, 
corresponding to the perpendicular bisector of (£1, £2 ) and (£1, £3), 
respectively. Note that since o is of equal distance to £i, £ 2 , and £3, 
it must be attached to a third edge which is part of the Voronoi cell 
for £2 and £3 - this is depicted as (£2 in the figure. 

In the following, we describe the computation of £i’s location in 
three steps: First, we show that, with knowledge of <£1, (£2 and (£3, 
one can readily compute the line from o to £1 - i.e., the angle a in the 
figure. Note that this indicates as long as one can do the same for 
another vertex of the Voronoi cell (say o'), then the location of £1 
can be derived as the intersection of two lines: (o, £1) and (o', £1). 
Of course, in practice we only know di and (£3 from the Voronoi 
cell computation, not (£2 • Thus, we demonstrate in the second step 
that deriving (£2 from d\ and (£3 takes only a single call to the binary 
search process. 

First, to understand how angle a can be derived from d\, (£ 2 . <£3, 
a key observation from Figure[l 0 ]is that the lines from o to any two 
tuples must form equal angle to the Voronoi edge between them 
- e.g., (o, £1) and (o, £3) must form equal angles to (£3. In other 
words, the angle between (o, £3) and (£3 is also a. Equipped with 
this observation, it becomes obvious that: 

(a + b) + (b + c) + (c+a) = 2n (4) 

=> a + £> + c = 7r ( 5 ) 

Since 6 + c is exactly the angle between d\ and (£ 2 , we can easily 
compute a as 7T — (6 + c). As such, we computed the line from o 












to 1 1 based on knowledge of only di, da and d 3 . 

Now we explain how one can compute d 2 - the only one of the 
three edges not part of the Voronoi cell of ti - with a single call to 
the binary search process. Note from the fact that we have com¬ 
puted both di and da that we must have issues at least one query 
which returns ta as the top result, say qa, and a query which returns 
£3 on the top, say qa- Obviously, da intercepts the line segment 
between qa and qa exactly once. Thus, we simply need to call the 
binary search process over (qa,qa) to derive da and enable the com¬ 
putation of fi’s exact location. One can see that, overall, the query 
complexity for computing both the Voronoi cell and the location of 
a tuple remains 0(m log(l/e)), where m is the number of edges 
for its Voronoi cell. 

5. DISCUSSIONS 

5.1 Aggregates with Selection Conditions 

In most of the previous discussions, we considered aggregates 
without selection conditions (i.e., every tuple in the bounding re¬ 
gion is aggregated). There is indeed a straightforward extension 
to aggregates with selection conditions - specifically, there are two 
possible scenarios: 

The first is when the selection condition can be “passed through” 
to LBS. For example, if our goal is to COUNT “STARBUCKS” 
within the bounding region, the selection condition Name = ‘STAR- 
BUCKS’ can be passed through to LBS - i.e., we simply append to 
each query we issue the exact same selection condition as the ag¬ 
gregate, Name = ‘Starbucks’. One can see that no other change 
is required to the aggregate estimation process. 

The other scenario is when the LBS does not support the selec¬ 
tion condition. For example, if we want to COUNT all businesses 
with at least an average review score of four stars within the bound¬ 
ing region, then we cannot simply pass this selection condition to 
an LBS that does not support filtering by average review scores. 
In this case, we simply need to “post-process” the selection condi¬ 
tion - e.g., for the above example, this means that after randomly 
choosing a query and obtain the returned tuple (as in §3.1| l, we first 
determine if the tuple satisfies the filtering condition. If so, we 
continue with the original process and return the same estimation. 
Otherwise, we return 0 (i.e., the aggregate query applied over the 
returned tuple, again divided by the sampling probability) as the 
estimation. One can see that the result remains an unbiased estima¬ 
tion for the aggregate, now with selection conditions. 

In the experiments, we shall demonstrate online tests over real- 
world LBS on aggregates with selection conditions in both cate¬ 
gories - e.g., COUNT of STARBUCKS over Google Maps, which 
can be passed through, and COUNT(restaurants) that are open on 
Sundays, which cannot. 

5.2 Leveraging External Knowledge 

In previous discussions, we focused on how to process the re¬ 
sults returned by a randomly chosen query (e.g., how to compute 
the top-A Voronoi cell of a returned tuple). The way the initial query 
is chosen, however, remains a simple design of choosing a location 
uniformly at random from the bounding region. Admittedly, with¬ 
out any knowledge of the distribution of tuple locations, uniform 
distribution appears the natural choice. Nonetheless, in real-world 
applications, we often have certain a priori knowledge of the tu¬ 
ple distributions, which we can leverage to optimize the sampling 
distribution of queries. 

For example, if our goal is to estimate an aggregate, say COUNT, 
of Point-Of-Interests (POIs, e.g., restaurants) in the US, a reason¬ 
able assumption is that the density of POIs in a region tends to be 


positively correlated with the region's population density. Thus, 
we have two choices: either to sample a location uniformly at ran¬ 
dom - which leads to POIs in rural areas to be returned with a much 
higher probability (because their Voronoi cells tend to be larger); or 
to sample a location with probability proportional to its population 
density - which hopefully leads to a more-or-less uniform selection 
probabilities over all POIs. Clearly, the second strategy is likely 
better for COUNT estimation, as a more uniform selection prob¬ 
ability distribution directly leads to a smaller estimation variance 
(and therefore error). For example, in the extreme-case scenario 
where all POIs are selected with equal probability, our COUNT 
estimation will be precise with zero error. Thus, an optimization 
technique we adopt in this case is to design the initial sampling dis¬ 
tribution of queries according to the population density information 
retrieved from external sources, e.g., US Census data Jl). 

There are two important notes regarding this optimization: First, 
no matter if the external knowledge is accurate or not, the COUNT 
and SUM estimations we produce always remain unbiased. This is 
obvious from 0 in O which guarantees unbiasedness no matter 
what the sampling distribution p(t) is. Second, the optimal sam¬ 
pling distribution depends on both the tuple distribution and the 
aggregate query itself. For example, if we want to estimate the 
SUM of review counts for all POIs, then the optimal sampling dis¬ 
tribution is to sample each tuple with probability proportional to its 
review count (as this design produces zero estimation variance and 
error). Given the difficulty of predicting the aggregate (e.g., review 
COUNT in this case) ahead of time, leveraging external knowledge 
is better considered as heuristics (a very effective one nonetheless, 
as we shall demonstrate in experimental results) rather than a prac¬ 
tice that guarantees the reduction of estimation errors. 

5.3 Special LBS Constraints 

We now consider two special constraints that are enforced by the 
query interfaces of some real-world LBS. The first one is a max¬ 
imum radius on the returned results - i.e., the distance between 
the query location q and the returned tuples is bounded by a pre¬ 
determined threshold d max . If no tuple in the database falls within 
the circle centered at q with radius d ma x, then the query returns 
empty. Google Maps and Weibo both enforce this constraint, with 
the threshold being 50 KM @ and 11 KVQ respectively. 

Interestingly, no change is required for our algorithms (both LR- 
LBS-AGG and LNR-LBS-AGG) to handle this situation. One can 
see that, as long as a query result is non-empty, the nearest neigh¬ 
bor is always returned, enabling the usage of our algorithms. When 
a query returns empty, we simply return 0 as the COUNT or SUM 
estimation (for this sample query). The unbiasedness is unaffected 
- note from 0 in that unbiasedness is guaranteed no matter 
if the sampling probability p(t) of all tuples sum up to 1 or not - 
as long as each tuple still has a positive probability to be returned. 
With this constraint, there is ]T] t p(t) < 1 with the remaining prob¬ 
ability returning 0 - still leading to an unbiased SUM or COUNT 
estimation. 

The second constraint we have observed from real-world LBS is 
a more complex ranking function that involves not only the distance 
between query location and a tuple but also other factors such as the 
static rank of certain attributes for the tuple. Google Places API is 
an example here, as it allows ranking by “prominence” which takes 
into account both distance and tuple popularit}]^] 

For this constraint, the applicability of our results is no longer 
straightforward. The key challenge here is that the area returning a 

L http://open.weibo.com/wiki/2/place/nearby/users 

5 Note that Google Places API also supports traditional distance-based ranking, en¬ 
abling the direct usage of our algorithms. 



tuple may become segregated across many disjoint regions, making 
it extremely difficult to compute the sampling probability (p(l) in 
QinSjITJ for a tuple. To understand why, consider an example 
where tuples are ranked according to the SUM of two scores, one 
is distance, awarding a higher score to a tuple closer by, but 0 to 
tuples more than 50 miles away. The other is a static score such 
as popularity. What might happen here is that the most popular 
tuple (in the bounding region, say US) is returned by queries on all 
places without a tuple within 50 miles (say the middle of a desert 
in Nevada). Clearly, it becomes extremely difficult to enumerate all 
the disjoint regions that return this tuple. 

Fortunately, for LR-LBS in practice, it is still highly likely for 
our LR-LBS-AGG algorithm to successfully handle the constraint 

- because the algorithm works properly as long as the nearest neigh¬ 
bor is always included in the top-fc results. Since an LR-LBS re¬ 
turns tuple locations, we can always post-process the query answer 
to obtain the nearest neighbor according to distance, and then apply 
our algorithm. Given that k 1 in real-world LBS, we anticipate 
a near-certain probability for the nearest neighbor to be included in 
the top-fc results, thus enabling LR-LBS-AGG. 

5.4 Extension to Higher Dimensions 

While LBS in practice is mostly confined to 2D, we would like 
to point out here (if only for theoretical interests) that our algorithm 
readily applies to fcNN queries over higher-dimensional data where 
Euclidean distance is used as the ranking function. Specifically, 
note that for LR-LBS, Theorem|7]holds no matter what dimension¬ 
ality the tuple locations have - as a higher-dimensional Voronoi cell 
computed from a subset of tuples still completely encompasses the 
real one. Similarly, all the optimizations discussed in §3.2| readily 
apply as well. For LNR-LBS-AGG, the only change required is on 
the binary search process: instead of finding the perpendicular bi¬ 
secting line between two tuples as in the 2D case, we now need to 
find the perpendicular bisecting (d— 1)-dimensional plane in the d- 
dimensional case. Correspondingly, each vertex of the rf-D Voronoi 
cell is now the interception of ( d 2 ) such (d— 1 )-dimensional planes. 

In other words, we still only need two vertices of the Voronoi cell 
to derive a tuple’s location in LNR-LBS - enabling the usage of 
LNR-LBS-AGG. 

6. EXPERIMENTAL RESULTS 
6.1 Experimental Setup 

Hardware and Platform: All our experiments were performed on 
a quad-core 2.5 GHz Intel i7 machine running Ubuntu 14.10 with 
16 GB of RAM. The algorithms were implemented in Python. 
Offline Real-World Dataset: To verify the correctness of our re¬ 
sults, we started by testing our algorithms locally over OpenStreetMap 
j4|, a real-world spatial database consisting of POIs (including restau¬ 
rants, schools, colleges, banks, etc.) from public-domain data sources 
and user-created data. 

We focused on the USA portion of OpenStreetMap. To enrich 
the SUM/COUNT/AVG aggregates for testing, we grew the at¬ 
tributes of POIs (specifically, restaurants and schools) by “JOIN- 
ing” OpenStreetMap with two external data sources, Google Maps 
|3j and US Census JTJ. Specifically, we added for each (applica¬ 
ble) restaurant POI its review ratings from Google Maps; and each 
school POI its enrollment number from US Census. The US Cen¬ 
sus data is also used as the (optional) external knowledge source 

- i.e., to provide the population density data for the optimization 
technique discussed in ij5] 

Note that we have complete access to the enriched dataset and 
full control over its query interface. Thus, we implemented a fcNN 


interface with ranking function being the Euclidean distance; re¬ 
turned attributes either containing all attributes including location 
(for testing LR-LBS) or without location (for LNR-LBS); and vary¬ 
ing k to observe the change of performance for our algorithms. 
Online LBS Demonstrations: In order to showcase the efficacy 
of our algorithms in real-world applications, we also conducted 
experiments online over 3 very popular real-world LBS: Google 
Maps J3), WeChat j 6 ). and Sina Weibo j5]. Each of these services 
has at least hundreds of millions of users. Unlike the offline exper¬ 
iments, we do not have direct access to the ground-truth aggregates 
due to the lack of partnership with these LBS. Nonetheless, we did 
attempt to verify the accuracy of our aggregate estimations with in¬ 
formation provided by external sources (e.g., news reports) - more 
details later in the section. 

In online experiments for LR-LBS, we used Google Maps, specif¬ 
ically its Google Places API |3j, which takes as input a query lo¬ 
cation (latitude/longitude pair) and (optionally) filtering conditions 
such as keywords (e.g., “Starbucks”) or POI type (e.g., “restau¬ 
rant”), and returns at most k = 60 POIs nearby, ordered by distance 
from low to high, with location and other relevant information (e.g., 
review ratings) returned for each POL 

For LNR-LBS, we tested WeChat and Sina Weibo using their re¬ 
spective Android apps. Both directly fetch locations from the OS 
positioning service and search for nearby users, with WeChat re¬ 
turning at most k = 50 and Sina Weibo returning k = 100 nearest 
users. Unlike Google Maps, these two services do not return the 
exact locations of these nearby users - but only provide attributes 
such as name, gender, etc. 

An implementation-related issue regarding WeChat is that, un¬ 
like its mobile apps, its API does not support nearest-neighbor 
search. Thus, we conducted our experiments by running its An¬ 
droid app (with support for nearest-neighbor search) on the offi¬ 
cial Android emulator, and used the debugging feature of location 
spoofing to issue queries from different locations. We then used 
the Monkey Runner tooj^Jfor Android emulator to interact with the 
app - i.e., sending queries and receiving results. Specifically, to 
extract query answers from the Android emulator, we first took a 
screenshot of the query-answer screen, and then parsed the results 
through an OCR engine, tesseract-ocrl 

Algorithms Evaluated: We mainly evaluated three algorithms in 
our experiments: LR-LBS-AGG and LNR-LBS-AGG from Sj3]and 
f|4] respectively, along with the only existing solution for LR-LBS 
(note there is no existing solution for LNR-LBS), which we refer 
to as LR-LBS-NNO fT0] |. LR-LBS-NNO has a number of tune¬ 
able parameters - we picked the parameter settings and optimiza¬ 
tions from m that provided the best performance. We also tested 
variants of our algorithms that lack certain variance-reduction tech¬ 
niques discussed in fj3]and the weighted sampling in order to demon¬ 
strate the effectiveness of these techniques. 

Performance Measures: As discussed in j|2] we measure effi¬ 
ciency through query cost, i.e., the number of queries issued to 
the LBS. Our estimation accuracy is measured experimentally by 
relative error. Each data point is obtained as the average of 25 runs. 

6.2 Experiments over Real-World Datasets 

Unbiasedness of Estimators: Our first experiment seeks to show 
the unbiasedness of our estimators for LR-LBS-AGG and LNR- 
LBS-AGG even after incorporating the various error reduction strate¬ 
gies. LR-LBS-NNO is known to be unbiased from |10| after an ex- 

f http://developer.android.com/tools/help/monkeyrunner_ 
concepts.html 

‘ https://code.google.com/p/tesseract-ocr/ 



Table 1: Summary of Online Experiments 



Figure 11: Voronoi Decomposition of Starbucks in US 


pensive bias correction step. Figure[l2|shows a trace of the three al¬ 
gorithms when estimating COUNT of all restaurants in US by plot¬ 
ting the current estimate periodically after fixed number of queries 
have been issued to LBS. We can see that LR-LBS-NNO has a 
high variance and takes significantly longer to converge while our 
estimators quickly converge to the ground truth much before LR- 
LBS-NNO. This indicates that the error reduction techniques suc¬ 
cessfully reduce the variance of our estimators. 

Query Cost versus Relative Error: We start by testing the key 
tradeoff - i.e., query cost vs. relative error - for all three algo¬ 
rithm over various aggregates. Specifically, Figures fi~4| 1 1 5 [ [T6| and 
[TTJshow the results for four queries, COUNT of schools in US, 
COUNT of restaurants in US, SUM of school enrollments in US, 
and AVG of restaurant ratings in Austin, Texas, respectively. One 
can see that not only our LR-LBS-AGG algorithm significantly out¬ 
perform the previous LR-LBS-NNO |l0) in all cases, even our al¬ 
gorithm for the LNR-LBS case achieves much better performance 
than the previous algorithm (despite the lack of tuple locations in 
query results). 

Query Cost versus LBS Size: Figure [18] shows the impact of LBS 
database size (in terms of number of POIs or users) on query cost to 
estimate the COUNT of schools in US for a fixed relative error of 
0.1 . We varied the database size by picking a subset of the database 
(such as 25%, 50%, etc) uniformly at random and estimating the 
aggregate over it. As expected for a sampling-based approach, the 
increase in database size do not have any major impact and only 
results in a slight increase in overall query cost (due to the more 
complex topology of Voronoi cells). 

Query Cost versus k: Figure [T9] shows how the value of k (the 
number of tuples returned by fc-NN interface) affects the query 
cost. Again, we measure the query cost required to achieve a rel¬ 
ative error of 0.1 on the aggregate COUNT of schools in US. We 
compared an variant that leverages our variance reduction strategy 
that adaptively decides which subset of tuples (i.e. h of top-fc) 
to use with fixed variants that uses all the top-fc tuples. As ex¬ 
pected, our adaptive strategy has a lower query cost and consis¬ 
tently achieves a saving of 10 % of query cost. 

Efficacy of Error Reduction Strategies: We started by verifying 
the effectiveness of weighted sampling using external knowledge - 
Figure [l3|compares the performance of the two sampling strategies 
- uniform and weighted - while estimating the COUNT of schools 
in US. One can see that the weighted sampling variants result in 
significant savings in query cost. 

In our final set of experiments, we evaluated the efficacy of the 
various error reduction strategies we described in the paper. We 
compared 5 different variants of our algorithm for LR-LBS rang¬ 
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ing from no error reduction strategies (LR-LBS-AGG-0) to sequen¬ 
tially adding them one by one in the order discussed in the paper 
culminating in LR-LBS-AGG that incorporates all of them. Fig¬ 
ure [20] shows the results of this experiment. As expected the first 
two strategies of faster initialization and leveraging history caused 
a significant reduction in query cost. We observed that the results 
for LNR-LBS were very similar. 

6.3 Online Demonstrations 

Google Places: Our first online demonstration of LR-LBS-AGG 
was on Google Places API and estimating two aggregates with dif¬ 
ferent selection conditions. The first involves selection conditions 
that can be passed over to LBS (COUNT of Starbucks in US) while 
the second involves aggregates with selection condition that cannot 
be passed over (see ^5] for discussion) such as COUNT of restau¬ 
rants in Austin, Texas that are open on Sundays. 

Table|7]shows the results of the experiments. We also verified the 
accuracy of our estimates for first aggregate (COUNT of Starbucks) 
through the public release of Starbucks Corp [2). One can see from 
the table that, with just 5000 queries, LR-LBS-AGG achieves very 
accurate estimations (< 5% relative error) for the count. 

To provide an intuitive illustration of the execution of our algo¬ 
rithm, we also continued the estimation of COUNT(“Starbucks”) 
until enumerating all Starbucks in the US. Figure [IT] demonstrates 
the Voronoi diagram constructed by our algorithm at the end. One 
can see the vastly different sizes of Voronoi cells - spanning hun¬ 
dreds of thousands km 2 in rural areas and smaller than 1 km 2 in 
urban cities, justifying the effectiveness of weighted sampling. 
WeChat and Sina Weibo: We estimated two aggregates, (1) total 
number of users and (2) gender ratio, over two LNR-LBS, WeChat 
and Sina Weibo, respectively. Table [T] shows the results of the ex¬ 
periments. One can observe from the table that our estimations 
quickly converge to a narrow range (+/- 5%) after issuing a small 
number of queries (10000). While we do not have access to the 
ground truth this time, we do note an interesting observation from 
our results: the percentage of male users is much higher on WeChat 
than on Sina Weibo - an observation verified by various surveys in 
China |7j. We would like to note that the COUNT aggregate mea¬ 
sures the number of users who have enabled the location feature of 
WeChat and Weibo respectively and is different from the number 
of registered or active accounts. 

Localization Accuracy: As a final set of experiments, we also 
evaluated the effectiveness of our Tuple position computation ap¬ 
proaches in tracking real world users. Specifically, we sought to 
identify the precise location of static objects located across the re¬ 
gion. We conducted this experiment over Google Places in US and 
WeChat in China. We treated Google Places as LNR-LBS by ignor- 
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mg the location provided its API. We sought to identify the loca¬ 
tion of 200 randomly chosen POIs after issuing at most 100 queries 
for each POI. For WeChat, we positioned our user at 200 diverse 
locations within China (typically in Urban places) and sought to 
identify the location. Since the precise location of the POI/user is 
known, we can compute the distance between actual and estimated 
positions. Figure [21] shows the result of the experiments. The re¬ 
sults show that more than 80% of the POIs were located within 20m 
of the exact location and every POI was located within a distance of 
75m. Due to the various location obfuscation strategies employed 
by WeChat, we achieved an accuracy of 50m or lower only 45% 
of the time. We still were able to locate user within 100m almost 
all the time. While our theoretical methods could precisely identify 
the location, the discrepancy in real-world occurs due to various ex¬ 
ternal factors such as obfuscation, coverage/localization limits etc. 
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7. RELATED WORK 

Analytics and Inference over LBS: Location based Services (LBS) 
such as map services (Google Maps) and location based social net¬ 
works (such as Foursquare, WeChat, Sina Weibo) are becoming 
popular in recent years. The prior work on analytics over LBS 
focussed exclusively on the LR-LBS scenario. The closest prior 
work is (TO) that seeks to estimate COUNT and SUM aggregates 
over LR-LBS using a nearest neighbor oracle. It then corrects 
the bias by using the area of the Voronoi cell using an approach 
that is very expensive. Aggregate estimation over LBS such as 
Foursquare that does not provide nearest neighbor oracle interface 


could be done using |19||27| . |19| proposed a random region sam¬ 
pling method with an unknown estimation bias that could be elim¬ 
inated using techniques from (27). However, none of them work 
for LNR-LBS. There has been work on inferring the location and 
other private information of users of LBS. |18| proposed trilater- 
ation based methods to infer the location of users even when the 
LBS only provided relative distances. There has been other ex¬ 
tensive work |16||23[|25j|28) on inferring location information and 
re-identification of users although none of them are applicable for 
the LBS models studied in this paper. 

Aggregate Estimations over Hidden Web Repositories: There 
has been a number of prior work in performing aggregate estima¬ 
tion over static hidden databases. G9 provided an unbiased esti¬ 
mator for COUNT and SUM aggregates for static databases with 
form based interfaces. EM®] describe efficient techniques 
to obtain random samples from hidden web databases that can then 
be utilized to perform aggregate estimation. Recent works such 
as |20[|26[ propose more sophisticated sampling techniques so as 
to reduce the variance of the aggregate estimation. For hidden 
databases with keyword interfaces, prior work have studied esti¬ 
mating the size of search engines mm or a corpus w 


8. FINAL REMARKS 

In this paper, we explore the problem of aggregate estimation 
over location based services that are increasingly popular. We in¬ 
troduced a taxonomy of LBS with fc-NN query interface based on 
whether location of the tuple is returned (LR-LBS) or not (LNR- 
LBS). For the former, we proposed an efficient algorithm and vari¬ 
ous error reduction strategies that outperforms prior work. We ini¬ 
tiate study into the latter by proposing effective algorithms for ag¬ 
gregation and inferring the position of tuple to arbitrary precision 
which might be of independent interest. We verified the effective¬ 
ness of our algorithms by using a comprehensive set of experiments 
on a large real-world geographic dataset and online demonstrations 
on high-profile real-world websites. 
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APPENDIX 

A. BINARY SEARCH PROCESS 

Design of Binary Search: Given the half-line I from c\ passing 
through C 2 , we conduct the binary search as follows. First, we find 
Cb, the intersection of this half-line with the bounding box. Then, 
we perform a binary search between ci and Cb to find a segment of 
the half-line with length at most 5, say with two ends being C 3 , C 4 
(with the distance between C 3 and C 4 at most J), such that while C 3 
returns t, C4 returns another tuple, say t' . This step takes at most 
log(6/<5) queries, where b is the perimeter of the bounding box. 

Then, we consider two half-lines i\ and £2, both of which start 
from ci and form an angle of — arcsin(<5 , /r) and + arcsin(<5 , /r) 
with I, respectively, where 5 1 is a pre-determined (small) threshold 
and r is the distance between ci and C4. For each £i, we perform the 
above binary search process to find a (at most) (5-long segment that 
returns t on one end and t' on the other. Note that such a process 
might fail - e.g., there might no point on h which returns t! . We set 
two rules to address this situation: First, we terminate the search 
for li if we have reached a segment shorter than 8, with one end 
returning t and the other returning a tuple other than t' . Second, 
we move on to the next step as long as (at least) one of £\ and I 2 
gives us a satisfactory (5-long segment. If neither can produce the 
segment, we terminate the entire process and output the following 
(estimated) Voronoi edge: the perpendicular bisector of (03, C4). 

Now suppose that £\ produces a satisfactory segment of at most 
(5 long. Let this segment be ( 05 , C 6 ). We simply return our (esti¬ 
mated) Voronoi edge as the line that passes through: (1) the mid¬ 
point of (C3,C4), and (2) the midpoint of (cs,C 6 ). One can see 
that the overall query cost of the binary search process is at most 
31og(6/(5). 

Algorithm[7]provides the pseudocode for Binary Search process. 


Algorithm 7 Binary-Search 

l: Input: Tuple t, Locations ci, C2 where query(ci) returns t 
2: Output: An edge of V (t) 

3 : Cb = Intersection of half-line ci, C2 with bounding box 
4 : Find C3, C4 s.t. dist{c3, C4) < 8 and query(c3 ) 7^ query^cf) 
5 : r — dist{c\,d) 

6 : Construct lines £\, I2 from ci with angles ± arcsin(5'/r) 

7 : (05, cq) = line segment on l\ or £2 with dist(c$, ce) < 8 and 
query(c 5 ) 7 ^ query(c 6 ) 

8: if none exists, return perpendicular bisector of (03, C4) 

9 : else return line segment passing through midpoints of (C3, C4) 
and (C5, C6) 


Error Bound on Edge Estimation: We have the following theo¬ 
rem on the error bound of this binary search process: 

THEOREM 3 . For a given tuple t and query location ci which 
returns t, for any other location C2, the Voronoi cell oft must have 
an edge £\ that intercepts half-line (ci, C2) such that the maximum 
edge error for estimating £y satisfies 

e < max(2(5 , I b ■ sin(arctan(5/(5 , ))). ( 6 ) 

In other words, for every point p £ £y, there exists a point p' on our 
estimated Voronoi edge ly generated from (ci,C 2 ) (i.e., p' £ £ y), 






such that 


d{p,p) < max(2 5',b- sin(arctan(d/d / ))), (7) 

where d(-, ■) is the Euclidean distance between two points. In ad¬ 
dition, for every vertex v of Isr, if line segment (t, v) intercepts l'y, 
then the interception point v' must satisfy 

d(t,v) — d(t,v') < max(2(5 / , b • sin(arctan(<5/<5 , ))). (8) 

A simple observation from the theorem is that the binary search 
process can reach an arbitrary precision level - i.e., for any given 
upper bound on d(p,p'), say du, we can set S' = du/2 and 

5 < tan ^arcsin (9) 

to satisfy the bound. Since both tan and arcsin can be bounded 
from both sides by a polynomial of its input (through Taylor ex¬ 
pansion), one can see that the corresponding query complexity is 
0 (log(fc/du)), leading to the following corollary on the maximum 
error edge defined in lj3] 


COROLLARY 1. The query cost required for achieving a max¬ 
imum edge error of e is 0(log(6/e)) - i.e., 0(log(l/e)) when we 
consider the bounding box size b to be constant. 


Error Bound on Voronoi Cell Volume Estimation: A direct corol¬ 
lary from Theorem[3]is an error bound on the estimated volume of 
a Voronoi cell. Note from our design of LNR-LBS-AGG that our 
estimated Voronoi cell is always a subregion of the real one. This, 
in combination with ({8]l in Theorem[3] leads to the following corol¬ 
lary. 


COROLLARY 2. For a given tuple t, the ratio between the vol¬ 
ume of the estimated Voronoi cell V' (l) and the real one V(t) sat¬ 
isfies 



- \V(t)\ 


< 1 


( 10 ) 


where d is the nearest distance between t and another tuple in the 
database, and e is the maximum edge error. 




